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Desoxyribonucleic  Acid 

A  FEW  HISTORICAL  LANDMARKS 


1869  Miescher  isolates  DNA 


1944  DNA  carries  the  genetic  information  (Avery) 


1 953  The  double  helix  structure  of  DNA  is  discovered 
by  Watson  and  Crick  *  l  cl 
— y  a  simple  model  for  the  transmission  of  the 
genetic  information 


1966  Niremberg,  Ochoa  and  Khorana  elucidate  the 
genetic  code 

— y  DNA  codes  for  proteins 

codon  ATG  GCG  ACG  . . .  GCC  GTG 

amino  acid  Met  Ala  Thr  Ala  Val 


TAA 


start 


stop 


DeoxyriboNucleic  Acid 
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Two 

Views 
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•  Double  helix  macromolecule 

•  Each  strand  consists  of  an  oriented  sequence  of  four 
possible  nucleotides: 

Adenine,  Thymine,  Guanine  &  Cytosine 

•  Complementary  strands: 

[A]=[T]  &  [G]=[C]  over  the  sum  of  both  strands 


Organization  of  the  human  genome 


Transcription 


% 


Exon  1 

Exon  2 

Exon  3 

Exon  4 

(  Traduction 


23  Chromosomes 
L  ~  lOOMbp. 


Genes (  20%) 

L  ~  lOkbp. 


'  Introns 

(INTervening  seq.) 
L  ~  lkbp. 

< 

Exons 

(Expressed  seq.) 
k  L  ~  150bp. 


L  ~  500AA. 


w  Non  genic  DNA 


Sequencing  projects  result  in  4  letter  texts  : 


gtcagtttcctgaggcgggtcgggacccaggcgtgagactggagtctgcc 

caggggcccagctgagccagcctcctcgtcagctgcttgggccgccagga 

cgccgccgggggtgcgccgcgcttccctggatggggtgcccccactcccc 

tcggagccccagggagaccccccgaactcagctcctctcaggggtgccag 

ggggacccctcaaactccactccccgcaggttcctggggagacgccccct 

gctcgattcccctcagggtcccagggagaccccctaattcagctcctctc 

aggggtactgggggacctctcgagctccactcccatcagggtcccaggga 

gaccccccaactatgctcaggggtcccagggagatgccagcaccccaact 

ccgcttccctggggcccccctccccttacagctcaacttccctcgagagt 

ctggggctggggctccgttcagttcttgagtccccttccctcggggtgtc 

ccggggccgcccacccccacactgtctgtgattccccaaggcgcgggtct 

cgggccgcagcctgttccacgttctgctgctcgttcttttctggctcctt 

gctttcgaaggagagaaggaggccttcgtttccagtctttttgccttttc 

taatggagccctgcttttccttccgtgtcccttcaggctacttctgccag 

gtttctatttttcattctttattatgacttcgcccaaaatattcttgact 

tctattgagaaggattcgggggtctatttcttattcggaggcgtgtgctt 

aagttccaaacagatgaggattttccagttaatccttctggggtgactta 

ttgcttaatgccaccatagccagaaaatggactctcagtgtccgaaactg 

cattcggctctgaagtgtctgtccttgtcacctcttgcaatgtttcgcgg 

cgggaagcctgcactcgccgacgctgacgtaactgtttctgtctttcagg 

tctacagcctcctgtgggtgggcgatattgacatatactttatttctata 

tatgttatgaactcaatatttcttgcagcgggtctgctgataataagata 

tgcctactctgcgagtctggaagccatcttaagcttaccctgtatgtgcc 

ccatgcatctcttccgttacacggctcctgagttgacacctgtgtgataa 

actggtaatagcaagtaaactgttttcttgtgctctgtaagctgctctag 

caaattatctaggaggaggtggtcttggaaacccctgatttataagcggg 

cagtcagcagtacacgtggcccagaatcgtgattggcatttgaagtgggg 

gcagtagggtgggactgagcccttcacctgtggggtctgccctgctcaag 

gcagtgtcagaattgaagtgaaatgttggacggtcggtgtccagagagtt 

ggagaactggtttgtgtgtaaaaactnacatatttagggtcagaagtatg 


HIERARCHICAL  STRUCTURE 
OF  EUCARYOTIC  DNA 
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NET  RESULT  :  EACH  DNA  MOLECULE  HAS  BEEN 
PACKAGED  INTO  A  MITOTIC  CHROMOSOME  THAT  IS 
50.000X  SHORTER  THAN  ITS  EXTENDED  LENGTH 


Different  ways  to  read  the  text 


I.  “Classical”  reading 

•  Looking  for  patterns 

—  Genes,  introns,  exons  detection 

-  Splicing  sites,  promoters,  replication  origins  recog¬ 
nition 

•  Characterizing  repetitions 

—  Tandem,  interspersed  repeats 

-  Oligonucleotide  usage 

•  Using  methods  such  as 

-  Hidden  Markov  chains 
—  Fourier  transform 

—  Dot-plot  matrices  and  recurrence  plots 


INVARIANCE  UNDER  TRANSLATION 


II.  The  physicist  reading 


•  Hypothesis:  The  DNA  text  results  from  a  stochastic 
process  : 

ACGTTCGAT  ? 

•  Question:  The  choice  of  the  next  nucleotide  : 

i.  Depends  on  a  finite  number  (Z0)  of  the  previous 
trials 

— ^  Short  range  correlations  and  exponential  decay 
of  the  correlation  function: 

C(l)  oc  exp(— Z/Z0) 

ii.  Depends  on  all  the  previous  nucleotides 

Long  range  correlations  and  power  law  decay 
of  the  correlation  function: 

c(i)  oc  rK 


INVARIANCE  UNDER  DILATATION 


DNA  WALK  REPRESENTATION  (Peng  etai.  92) 


1 .  Each  nucleotide  is  associated  to  a  numerical  value 
(A  to  a,  T  to  t,  G  to  g  and  C  to  c). 

purine-pyrimidine  :  a  =  g  =  1  and 
weak-strong  :  a  =  t  =  1  and  g  =  c  =  —  1 
amino-keto  :  a  =  c  =  1  and  t  —  g  =  —  1 

A-non  A  :  a  =  1  and  t  =  g  =  c  =  — 1/3 
T-non  T  :  t  =  1  and  a  =  g  =  c  =  — 1/3 
G-non  G  :  p  =  1  and  a  =  t  =  c  =  —1/3 
C-non  C  :  c  =  1  and  a  =  t  =  g  =  - 1/3 

2.  Suppose  you  have  a  walker  on  the  line.  The  value  asso¬ 
ciated  to  the  ith  nucleotide  defines  the  ith  step  S(i)  of  the 
walker 


Example  using  the  purine  (f)  pyrimidine  (  )  distinction  : 


AT  GGCGACGAAGCT 
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1 1  tt 


1 1 1 1 
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Exon  of  the  human  PKD1  gene 


Intron  of  the  human  dystrophin  gene 


Most  of  the  physicist  works  amount  to  characterizing 
the  roughness  of  a  DNA  walk  landscape 


Exon  of  the  human  PKD1  gene 


Intron  of  the  human  dystrophin  gene 


0  500  1000 
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Most  of  the  physicist  works  amount  to  characterizing 
the  roughness  of  a  DNA  walk  landscape 


FRACTAL  SIGNALS 
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ROUGHNESS  EXPONENT 


a 
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0  x  L 


•  Root-mean  square  of  the  height  fluctuations  : 

W(L)=V  <f2(x)>  -  <  f  (x)  >  2  ~ 

H  =  roughness  exponent 

•  Random  walk 

•  0.5  <  H  <  1  LONG  RANGE  CORRELATIONS  (LRC) 

•  H  =  0.5  UNCORRELATED 

•  0  <  H  <  0.5  ANTI-CORRELATIONS 


lh 

Df  =  2  -  H 


•  Power  spectrum 

Sf  (k)  ~  k  -(2H+1> 


•  Correlation  function 

Cf(l)  =  <f(x)f(x+l)>  -  <f(x)>2  ~  l2H 


Are  the  observed  LRC  a  bias  in  the  measurement  ? 


Is  the  mosaic  structure  of  DNA  enough  to  account  for  the  ob¬ 
served  misleading  LRC  in  DNA  sequences  ? 

Karlin  and  Brendel  93  : 


Bacteriophage  X 


Human  j8— myosin  gene 


n 


A  specific  analysing  tool  is  needed  to  avoid  confus¬ 
ing  a  biased  uncorrelated  random  walk  with  an  un¬ 
biased  correlated  random  walk 


WAVELET  ANALYSIS  OF  FRACTAL  SIGNALS 


Tg  (a,b)  =  \ 


X-3 


a 


f(x)  dx 


Mathematical  microscope 


“  Singularity  scanner” 

The  wavelet  transform  allows  us  to  LOCATE  (b)  the 
singularities  of  f  and  to  ESTIMATE  (a)  their  strength  h(x) 
(Holder  exponent) 


log2(a)  Tg(a0,x)  P  log2(a) 


CONTINUOUS  WAVELET  TRANSFORM  OF  THE 
TRIADIC  DEVIL’S  STAIRCASE 


THE  DEVIL’S  STAIRCASE 

<a)  F(x)  =  fX  d|r(x) 

J— 00 


WAVELET  TRANSFORM 
REPRESENTATION 


o  1 
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WAVELET  TRANSFORM 
MODULUS  MAXIMA 
(WTMM) 


WTMM  SKELETON 

III 

WTMM  SKELETON  OF  THE 
TRIADIC  CANTOR  SET 


F(x)  is  continuous  but  non  differentiable.  F’(x)=0  almost  everywhere. 
Its  continuous  variation  occurs  over  a  set  of  Lebesgue  measure  =  0 
and  dimension  DF  =  log  2  /  log  3 


Fractal  measures 


*  Invariant  measures  associated  with  the  strange  attractors 
of  discrete  dynamical  systems 

*  Turbulent  energy  dissipation 
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Fractal  signals 


*  Weierstrass  functions 

*  Fractional  Brownian  motions 

*  Turbulent  signals 

DEVIL’S  STAIRCASE 


F(X)  =Lx^n(x) 

Characteristic 
function  of  p 


F(x)  is  continuous  but  non  differentiable.  F’(x)=0  almost  everywhere. 
Its  continuous  variation  occurs  over  a  set  of  Lebesgue  measure  =  0 
and  dimension  DF  =  log  2  /  log  3 


log sa  log^a 


Wavelet  analysis  of  the  DNA 

SEQUENCE  OF  THE  BACTERIOPHAGE  A 


Bacteriophage  A 
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SYNTHETIC  DNA  SEQUENCES 


Uncorrelated 


random  sequence 


w  =  32bp 


w  =  5 12bp 
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Long  range  correlated 

random  sequence 


w  =  32bp 


5000  104  1.5  io4 


w  =  512bp 
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SYNTHETIC  DNA  WALKS 

Fractional  Brownian  motions  :  BH 


H  =  0.3  anti-correlated 


H  =  0.7  long-range  correlated 


A  UNIQUE  WAY  TO  DISPLAY  RESULTS 


1 .  Straight  line  <$■  scale  invariance  properties 


2.  The  slope  of  a  linear  behavior  gives  the  rough¬ 
ness  exponent  H 

(if  =  0.5  NoLRC 
1  H  >  0.5  LRC 


A  UNIQUE  WAY  TO  DISPLAY  RESULTS 


1 .  Straight  line  <$■  scale  invariance  properties 


2.  The  slope  of  a  linear  behavior  gives  the  rough¬ 
ness  exponent  H 

(if  =  0.5  NoLRC 
1  H  >  0.5  LRC 


LRC  AND  THE  ISOCHORE  STRUCTURE  OF 
WARM  BLOODED  VERTEBRATES 
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LRC  increase  with  the  G  +  C  content  of  isochores 

This  result  remains  valid  for  genomes  that  don’t  possess  an 
isochore  structure  ! 


Which  biological  mecanisms  can 

ACCOUNT  FOR  LRC  IN  DNA  SEQUENCES 

•  Genomes  dynamics  and  plasticity 

Point  mutation 
Insertion,  deletion 
Transposition 

Duplication  of  exons,  genes  or  chromosomes 
Recombinaison 

Generalized  Levy  walk  model  (Buldyrev  et  al.  93) 

Length  distribution  of  protein  coding  segments  (Herzel  and  Grofte 
97) 

•  Compaction  constraints  -  Accession  to  informa¬ 
tion 

Nucleosome 
Chromatine  fiber 

Higher  order  folding  up  to  the  metaphase  chromosome 

Fractal  model  of  chromosomes  (Takahashi  89) 

Crumpled  globule  model  (Grosberg  et  al.  93) 


HIERARCHICAL  STRUCTURE 
OF  EUCARYOTIC  DNA 
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h(q,n)-O.G\og10n  h(q,N)-0.6\ogI0n 


Statistical  analysis  of  the 
eukaryotic  genome  of  Saccharomyces 

cerevisiae 


Universality  between  the  16  chromosomes  of  yeast 
Universality  between  the  4  mononucleotidic  codings 
nc  ~  200bp  is  a  characteristic  length  scale 


Yeast  chromosome  I 


t  t 

Gaussian  statisics  at  small  scales  (n  <  200bp) 

Non  Gaussian  (fat  tails)  statistics  at  large  scale  (n  >  200bp) 


Statistical  analysis  of  the  bacterial 
genome  of  Escherichia  coli 


Universality  between  the  4  mononucleotidic  codings  and  with 
the  eukaryotic  genome  of  yeast 

nc  ~  200bp  is  a  characteristic  length  scale 


t  t 

Gaussian  statisics  at  small  scales  (n  <  200bp):  H  —  0.5 

Non  Gaussian  (fat  tails)  statistics  at  large  scale  (n  >  200bp): 
H  =  0.75 


DNA  WALKS  THAT  REFLECT  THE 


STRUCTURE  OF  THE  DNA  POLYMER 


(a) 


w<t)  =  34.3° 


w(t>  =  34*3°.  $(*)  =  0° 


2  trinucleotide  codings  based  on  experiments  : 


Trinucleotide 

PNuc 

DNase  1 

AAA/TTT 

0.0 

0.1 

AAC/GTT 

3.7 

1.6 

AAG/CTT 

5.2 

4.2 

AAT/ATT 
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Trinucleotide 

PNuc 

DNase  1 
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1.  Nucleosome  positioning  model  (PNuc) 


related  to  curvature  ? 

2.  DNase  I  digestion  data 

related  to  bending  propensity 


S.cerevisiae  E.coli 


{ - )  DNA  text  (o)  PNuc;  (■)  DNase  I 

Hypothesis:  LRC  in  the  small  scales  regime  is  the  signature  of 


of  the  nueleosomal  structure 


h(q,n)~0.6log10n  h(q,n)~0.6\og10n  h(q,n)~0.6\og10n 


Eucaryotes 
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Drosophila  melanogaster  Treponema  pallidum 
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Arabidopsis  thaliana 
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( - )  DNA  text,  (o)  PNuc,  (■)  DNase  I 


Nucleosomes  No  nucleosomes 


Small  scales  LRC  are  related  to 

NUCLEOSOME  LIKE  STRUCTURES 


Epstein- Barr  virus 


Bacteriophage  T4 


Pox  virus  don’t  display  LRC  in  the  small  scale  regime 


Archaeoglobus  fulgidus  Pyrococcus  horikoshii 


Archaebacteria  display  LRC  in  the  small  scale  regime 


AFM  visualisation  of  a  reconstituted 

chromatin  fiber 

Pierre-Louis  Porte,  Emeline  Fontaine,  Cendrine  Moskalenko 
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Digital  Instruments  NanoScope 
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1.  Nucleosome  positioning  model  (PNuc) 


related  to  curvature  ? 

2.  DNase  I  digestion  data 

related  to  bending  propensity 


S.cerevisiae  E.coli 


( - )  DNA  text  (o)  PNuc:  (■)  DNase  I 


Hypothesis:  LRC  in  the  small  scales  regime  is  the  signature  of 
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LARGE  SCALE  REPRESENTATION 
OF  GENOMIC  SEQUENCES 


Space-Scale  Representation  of  the  GC  Content 
with  a  Smoothing  Gaussian  Filter 
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Filtering  scales:  a*  =  40kb,  a \  =  160kb 


Space-scale  content:  S(a)  =  a)|, 

where  ^ m  is  the  Morlet  wavelet 
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Opening  of  the  double  helix  with  a  different 
environment  for  each  strand  =>  asymmetrical  process 


Symmetrical  properties  of  the  strands: 
“Parity  Rule  type  2” 


[A]  =  [T]  &  [G]  =  [C] 

in  each  strand 
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Compositional  skew  due  to  local  biases  in  a  strand  in 
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Our  model  :  well  defined  replication  origins,  separated  by 
diffuse  terminuses 
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Profile  detection  using  an  analyzing  wavelet 
adapted  to  the  shape  of  replicons 
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adapted  to  the  shape  of  replicons 


Deterministic  Chaos  in  DNA  Sequences 
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Equation  of  non-linear  oscillator 
which  displays  homoclinic  chaos  of  ShiTnikov’s  type: 

0  +  ^9  +  n\9  +  no9  +  k9s  =  0 

9  and  t  were  rescaled  so  that  the  chaotic  trajectory  displays  similar 
amplitude  and  characteristic  frequencies  as  the  skew  oscillatory  profiles. 
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